\[~\]

\[~\]

Introduction

\[~\]

Nowadays, with the improvement of the technologies there are an amount of data that allows us to understand if there is any problem or not in the healthy of a person.

Here in this project we explain and we try to predict whether a patient should be diagnosed with Heart Disease or not.

\[~\]

The Dataset

\[~\]

\[~\]

Kaggle, is the main platform where I found this interesting dataset where applying our main bayesian inference as the main scope of this project.

This dataset is composed by these features:

## 
## -- Column specification --------------------------------------------------------
## cols(
##   age = col_double(),
##   sex = col_double(),
##   cp = col_double(),
##   trtbps = col_double(),
##   chol = col_double(),
##   fbs = col_double(),
##   restecg = col_double(),
##   thalachh = col_double(),
##   exng = col_double(),
##   oldpeak = col_double(),
##   slp = col_double(),
##   caa = col_double(),
##   thall = col_double(),
##   output = col_double()
## )
## # A tibble: 6 x 14
##     age   sex    cp trtbps  chol   fbs restecg thalachh  exng oldpeak   slp
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl> <dbl>
## 1    63     1     3    145   233     1       0      150     0     2.3     0
## 2    37     1     2    130   250     0       1      187     0     3.5     0
## 3    41     0     1    130   204     0       0      172     0     1.4     2
## 4    56     1     1    120   236     0       1      178     0     0.8     2
## 5    57     0     0    120   354     0       1      163     1     0.6     2
## 6    57     1     0    140   192     0       1      148     0     0.4     1
## # ... with 3 more variables: caa <dbl>, thall <dbl>, output <dbl>

\[~\]

Analyzing these features, I recognized that the:

  1. Age -> it refers to the age of the patient
  2. sex -> it refers if the patient is:
    • male (1)
    • female (0)
  3. cp -> it refers to the chest pain type that could be of four types:
    • 0 is the typical angina
    • 1 is the atypical angina
    • 2 is the non-anginal pain
    • 3 is the asymptomatic type
  4. trestbps -> it refers to the resting blood pressure
  5. chol -> it refers to the serum cholesterol in mg/dl
  6. fbs -> it refers to the fasting blood sugar that if it is larger than 120 mg/dl is represented with:
    • 1 (true)
    • 0 (false)
  7. restecg -> it refers to the resting electrocardiographic results:
    • 0 as Normal
    • 1 having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
      1. showing probable or definite left ventricular hypertrophy by Estes’ criteria
  8. thalachh -> it refers to the maximum heart rate achieved
  9. exng -> it refers to the exercise induced angina that could be:
    • 1 (yes)
    • 0 (no)
  10. oldpeak -> it refers to the Previous peak
  11. slp -> it refers to the peak exercise ST segment:
    • 0 as up sloping
    • 1 as flat
    • 2 as down sloping
  12. caa -> it refers to the number of major vessels that goes from 0 to 3
  13. thall -> it refers to the maximum heart rate achieved:
    • 0 as no-data
    • 1 as normal
    • 2 as fixed defect
    • 3 as reversible defect
  14. output -> it refers to the fact to have the heart attack or not

\[~\]

The main features detected are:

\[~\]

##       age             sex               cp             trtbps     
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :55.50   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.42   Mean   :0.6821   Mean   :0.9636   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##       chol            fbs           restecg          thalachh    
##  Min.   :126.0   Min.   :0.000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:133.2  
##  Median :240.5   Median :0.000   Median :1.0000   Median :152.5  
##  Mean   :246.5   Mean   :0.149   Mean   :0.5265   Mean   :149.6  
##  3rd Qu.:274.8   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.000   Max.   :2.0000   Max.   :202.0  
##       exng           oldpeak           slp             caa        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3278   Mean   :1.043   Mean   :1.397   Mean   :0.7185  
##  3rd Qu.:1.0000   3rd Qu.:1.600   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##      thall           output     
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:0.000  
##  Median :2.000   Median :1.000  
##  Mean   :2.315   Mean   :0.543  
##  3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :1.000

\[~\]

The Data Visualization

\[~\]

In this subsection we want to describe graphically the dataset that we are analyzing to have a first taste of what is the better model to use in this case, we want to show below some interesting plots

\[~\]

The visualization using PCA

\[~\]

PCA is a particular tool that allows us to show the behaviour of the patients (treated as points in this space) of our data. We see in this way which are the patients more similar to each other

\[~\]

\[~\]

Its normal to denote that there could be some losses on this representation due to the amount of features that we want to consider.

\[~\]

The Categorical Data

\[~\]

Here, we want to analyze the percentages of each categorical data:

\[~\]

\[~\]

The Quantitative Data

\[~\]

Here, we illustrate the main features of the quantitative data, after have seen the categorical data in the previous section:

\[~\]

hist.and.summary('age', 'Persons Age')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   48.00   55.50   54.42   61.00   77.00
hist.and.summary('chol', 'Cholestoral')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   126.0   211.0   240.5   246.5   274.8   564.0
hist.and.summary('thalachh', 'Maximum Heart Rate Achieved')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.0   133.2   152.5   149.6   166.0   202.0
hist.and.summary('trtbps', 'Resting Blood Pressure')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    94.0   120.0   130.0   131.6   140.0   200.0
hist.and.summary('oldpeak', 'Previous Peak')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.800   1.043   1.600   6.200

\[~\]

After this, we consider also the densities considering the case of having the heart attack and not:

\[~\]

dense.chart('age', 'Persons Age')
dense.chart('chol', 'Cholestoral')
dense.chart('thalachh', 'Maximum Heart Rate Achieved')
dense.chart('trtbps', 'Resting Blood Pressure')
dense.chart('oldpeak', 'Previous Peak')

The Correlations

\[~\]

Here, we want to highlight the correlations between the features treated in qualitative, quantitative and without any distinction.

\[~\]

The Goal

\[~\]

The main goal is to leverage the main fully Bayesian analysis, based on understanding if a person has a heart attack or not.

So, the response variable \(Y_i\) (that is the “output” feature in our dataset) is the heart attack \(\in \{0,1\}\) and the predictor variables (\(x_i \in \mathbb{R}^{+}\)) have been chosen using the glm() function, used to fit generalized linear models, as we can see below:

\[~\]

## 
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + 
##     x10 + x11 + x12 + x13, family = binomial(), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5751  -0.3868   0.1514   0.5841   2.6239  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.304169   2.577891   1.282 0.199936    
## x1          -0.001469   0.023464  -0.063 0.950062    
## x2          -1.750930   0.468199  -3.740 0.000184 ***
## x3           0.847283   0.185548   4.566 4.96e-06 ***
## x4          -0.020188   0.010386  -1.944 0.051916 .  
## x5          -0.004489   0.003806  -1.179 0.238252    
## x6           0.073463   0.532452   0.138 0.890263    
## x7           0.450607   0.348505   1.293 0.196021    
## x8           0.023134   0.010449   2.214 0.026835 *  
## x9          -0.981017   0.409806  -2.394 0.016672 *  
## x10         -0.523604   0.214467  -2.441 0.014630 *  
## x11          0.589074   0.349864   1.684 0.092236 .  
## x12         -0.826015   0.201922  -4.091 4.30e-05 ***
## x13         -0.887203   0.290730  -3.052 0.002276 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.42  on 301  degrees of freedom
## Residual deviance: 210.35  on 288  degrees of freedom
## AIC: 238.35
## 
## Number of Fisher Scoring iterations: 6

\[~\]

As we can see above, the glm rejects the hypothesis to consider the variables:

… and also the intercept isn’t a good choice to admit it in our model.

then, in summary we have N = 302 observations. Then we decided to consider two models and see the difference on which model at the end is the best model of our analysis.

\[~\]

Preliminary brief definitions

\[~\]

\[~\]

The Second Model

\[~\]

The Cumulative Means

\[~\]

The Approximation Errors

\[~\]

The Predictions

\[~\]

The Correlations

\[~\]

The comparison between these two models

\[~\]

The Conclusions

\[~\]

\[~\]

The References

\[~\]